This project is an exploratory data analysis of White Wine Quality to determine which features affect wine quality. The dataset was created in 2009 by Paulo Cortez, F. Almeida, T. Matos and J. Reis and contains wine preferences accompanied by their physicochemical properties. The Quality rating output is based on sensory data, where at least 3 evaluations were made by wine experts per wine sample.
The WhiteWine Dataset Contains the Following:
Unique identifier:
1 - X
Input variables (based on physicochemical tests):
2 - fixed acidity
3 - volatile acidity
4 - citric acid
5 - residual sugar
6 - chlorides
7 - free sulfur dioxide
8 - total sulfur dioxide
9 - density
10 - pH
11 - sulphates
12 - alcohol
Output variable (based on sensory data):
13 - quality (score between 0 and 10)
I will use these values to determine which features have the greatest affect on a Wine's quality rating.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## [1] 4898
## [1] 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
A quick summary of the data shows that White Wine is acidic with a max pH level of 3.82, an avg. alcohol content of 10%, & a wine quality rating that usually falls between 5-6 on a scale of 10.
This bar plot of Wine Quality shows that most ratings fall between 5 and 7, which is consistent with the dataset's description that there are much more normal wines than excellent or poor ones.
## [1] "The alcohol level feature in the White Wine dataset appears to be positively skewed right."
## [1] "The distribution is skewed right."
## [1] "pH shows a bell like shape, so it appears to be normally distributed."
## [1] "Fixed Acidity also appears to be normally distributed."
## [1] "Volatile acidity in large quanitites can lead to an unpleasant vinegar taste in wine, I wonder wether the wine quality will correlate to this?"
## [1] "Distribution appears symmetric with a couple of outliers."
## [1] "The Chlorides distribution appears to be symmetric, non normal, &\nshort tailed."
## [1] "Free Sulfur Dioxide has a normal distribution."
## [1] "Free Sulfur Dioxide has a normal distribution."
## [1] "Density has a symmetric, non normal, & short tailed distribution."
## [1] "Sulphates can contribute to levels of sulphor dioxide in wine, so I expect their destributions to be related. However it more visually resembles the Chloride distribution. The Sulphates show a normal distribution."
The above plots show a smoothed version of a histogram for each input variable. Adjustments had to be made to most of the plots to display the data more clearly. The adjustments were made by setting scale limits for the (x,y) axis. The density estimates allow for more readable distributions.
The csv dataset contains 4,898 observations with 13 features: X, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, & quality. Two of the feature variables are integers with X containing a unique integer identifier, quality containing an integer as an output scale variable, and the remaining variables containing input numerical values of physical and chemical properties.
Quality is the main feature of interest. Of 11 input variables, I hope to determine which features influence the quality rating.
I predict that fixed.acidity, residual.sugar, & alcohol all contribute to the quality rating.
I did not create a new variable.
No, I did not have to perform any operations on the data. The normal distributions shown worked for what I wanted to view.
## F.A V.A C.A Sug Chl F.S T.S Den pH Sul Alc
## F.A 1.00 -0.02 0.29 0.09 0.02 -0.05 0.09 0.27 -0.43 -0.02 -0.12
## V.A -0.02 1.00 -0.15 0.06 0.07 -0.10 0.09 0.03 -0.03 -0.04 0.07
## C.A 0.29 -0.15 1.00 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08
## Sug 0.09 0.06 0.09 1.00 0.09 0.30 0.40 0.84 -0.19 -0.03 -0.45
## Chl 0.02 0.07 0.11 0.09 1.00 0.10 0.20 0.26 -0.09 0.02 -0.36
## F.S -0.05 -0.10 0.09 0.30 0.10 1.00 0.62 0.29 0.00 0.06 -0.25
Dsplaying the first 6 rows of the newly created matrix showing correlations of features(not incl X or Quality). I will use this matrix to plot a Correlation plot to determine which features are closely related.
This is a correlation plot that shows the correlations of all of the input variables. Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients.
## [1] "Density & Residual Sugar show the highest correlation coefficient of 0.84. This graph display the positive correlation showing that as residual sugar content increases, the density also increases."
## [1] "A lcohol & Residual Sugar show a correlation coefficient of -0.45. This plot shows that as Alcohol percentage increases the amount of sugar decreases."
## [1] "The alcohol & density relationship shows a negative correlation, which means as alcohol content increases, the density of the wine decreases."
## [1] "Free.sulfur.dioxide & total.sulfur.dioxide show a positive correlation, so as one value increases the other also increases."
## [1] "Fixed.acidity & pH show a negative correlation, so as pH increases fixed.acidity decreases."
Scatter plots displaying the strongest correlations found within the correlation plot. The information on display is not very useful toward the relationships I want to investigate. However, I can use these plots in the following section to investigate by adding quality as color in Multivariate plots.
Since I predict Alcohol to be one of the greatest determining factors of a wine's quality rating, I decided to plot Alcohol Content vs. Quality Rating in a box plot. This plot shows that the wines with a higher than average alcohol content are also the wines with the highest quality ratings.
I observed correlations between alcohol & residual.sugar and alcohol & density, but the greatest correlation was between sugar & density.
I noticed that there were correlations in alcohol & density, sugar & density and free.sulfur.dioxide & total.sulfur.dioxide. My predictions did not focus on density or sulfur.dioxide as a factor.
The alcohol & density relationship shows a negative correlation, which means as alcohol content increases, the density of the wine decreases. Density also has a positive relationship with total.sulfur.dioxide, which means as total.sulfur.dioxide increases the density of the wine increases.
The strongest relationship appeared to be between sugar & density at an 84% correlation, followed by alcohol & density at 78%.
In this section, I chose to plot the same plots from the Bivariate including an overlay of a Purple color pallete representing Quality rating by color intensity.
The plots of correlated features displayed in this section now show an added layer of color that shows the associated quality rating for each observation. It is hard to judge any trends based on the wide range of quality rating colors, so I created a new value called 'qrating' to display the quality ratings in three categories.
I decided to map quality ratings into three groups: Poor, Average, & Great. The 'Poor' group consists of ratings 3,4,5 which make up 1640/4898 or 33% of the ratings. The 'Average' group contains the 6 rating for 2198/4898 or 45%. The 'Great' group contains the higher tier of wines rated at 7,8,9 to account for 1060/4898 or 22% of the ratings. The dataset states that "the classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones)." So 'Average' is synonomous with 'Normal' for this Analysis, as well as 'Great' equals 'Excellent'.
## Poor Average Great
## 1640 2198 1060
## [1] "Density appears to show a correlation to quality, even though residual.sugar does not."
## [1] "Alcohol appears to show a correlation to quality, even though residual.sugar does not."
## [1] "Alcohol & Density, both show show a correlation to quality, so as alcohol content increases wine quality increases and density decreases."
## [1] "Free.sulfur.dioxide & total.sulfur.dioxide show no clear correlation to wine quality."
This plot displays Fixed Acidity vs. Residual Sugar vs. pH, I wanted to determine if there were correlations between the remaining features I focused on from beginning of the project. There's only a relationship between fixed.acidity and pH, where acidity increases as the pH value decreases.
The following shows the number of observations in each Density category.
## Low Average High
## 1185 2442 1271
This scatter plot shows another representation of the correlation between Alcohol content, quality rating, and density. I also made use of a green color palette, to differentiate the 'dvalue' plot from previous 'qrating' plots.
The density value is categorized as Low (Values < 0.9917), Average (Values between 0.9917 & 0.9961), & High (values > 0.9961).
This representation shows a more useful depiction, which can be interpreted that as the alcohol percentage increases, the level of density falls and vice versa. So both a lower density & higher alcohol content positively affect the quality rating. This plot appears to be the most effective in displaying the strongest correlation, as sugar was found to have no bearing on quality.
It appears that, Density affects Quality, but Residual Sugar does not. I also observed that Alcohol and Density both display a strong affect on Quality. Alcohol also individually shows the strongest relationship with Quality.
I was suprised that alcohol content also directly affected the density of the wine. I was also suprised that there were very few strong relationships between the inputs & quality rating output. The majority of the plots show a lack of consistency which negates any idea of a strong correlations.
No, I did not.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Quality rating displayed as an output of the wine observation data in the form of a Histogram. This bar plot of Wine Quality shows that most ratings fall between 5 and 7, which is consistent with the dataset’s description that there are much more normal wines than excellent or poor ones.
Box plot displaying the range of alcohol (%/vol) for each Quality rating as a number. This plot confirms my analysis that the wines with a higher than average alcohol content are also the wines with the highest quality ratings. Had I created qrating earlier in the project, I could have displayed the quality rating data on a simpler plot by using the three tier rating scale that was utilized for testing. ‘qrating’ allows me to display this finding more clearly.
This scatter plot shows another representation of the correlation between Alcohol content, quality rating, and density. I also made use of a green color palette, to differentiate the ‘dvalue’ plot from previous ‘qrating’ plots.
The density value is categorized as Low (Values < 0.9917), Average (between 0.9917 & 0.9961), & High (values > 0.9961).
This representation shows a more useful depiction, which can be interpreted that as the alcohol percentage increases, the level of density falls and vice versa. So both a lower density & higher alcohol content positively affect the quality rating. With a correlation coefficient of -0.78 for Alcohol/Density, -0.31 Quality/Density & 0.43 Quality/Alcohol, this plot appears to be the most effective in displaying the strongest correlations, as sugar was found to have no bearing on quality.
——
The White Wine dataset contains 4898 observations with 13 features. Of the 13 features there are 11 input features, one output feature, and one unique identifier. The purpose of this Exploratory Data Analysis was to determine which features impact the quality of the wine.
I initially used histograms to display the data, but I was not able to determine any correlations. Scatter plots provided more intuitive visualizations that could be easily decoded. The use of Bivariate Plots allowed me to obtain correlations. These plots helped me determine that residual.sugar & density shared the strongest correlation. I then used their strong correlation to determine if they affected quality jointly or individually. I discovered that density affected quality, but residual.sugar did not. The second strongest correlation was between alcohol & density and they jointly affected White Wine quality. The third strongest correlation was between alcohol & residual.sugar, however my plot again confirmed that residual.sugar did not appear to affect quality. The fourth strongest correlation was between free.sulfur.dioxide & total.sulfur.dioxide, but total.sulfur.dioxide had more impact on quality than free.sulfur.dioxide.
In conclusion, my prediction that the three features fixed.acidity, residual.sugar, & alcohol affect wine quality proved to be incorrect. Of those features, only alcohol was proven to affect wine quality. In addition to alcohol, density was also shown to affect wine quality. Based on these findings, I would use alcohol percentage and density to predict White Wine quality in an future investigations.